Topics Covered

  1. Visualizing data using ggplot2
  2. Data Joins

Visualizing data using ggplot2

ggplot2 is a declarative plotting library in R. The gg in ggplot2 stands for “Grammar of Graphics” a book on the principles of data visualization that the package is based on. According to this article:

a grammar of graphics is a framework which follows a layered approach to describe and construct visualizations or graphics in a structured manner. A visualization involving multi-dimensional data often has multiple components or aspects, and leveraging this layered grammar of graphics helps us describe and understand each component involved in visualization — in terms of data, aesthetics, scale, objects and so on.

You can think about the layering of graphics in the same way as composition that you learnt last week. The idea of building complex functionality by breaking things down into smaller pieces using pipes.

Source: A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data - Dipanjan Sarkar

There are seven layers in ggplot. All plots need atleast three out of these seven layers (the rest adopt default values if not specified). These are as follows:

  1. A data layer
  2. An aesthetics layer
  3. A geom or statistics layer

Lets start plotting!

Basic charts

In this section we will explore a few basic chart types. Lets also open the reference website for ggplot in a new tab so that we have it handy. We will be using the gapminder dataset to illustrate ggplot commands. This dataset is available as a package of the same name 1. Go ahead and install this on your computer. The gapminder dataset is displayed below. It provides data on life expectancy, population and GDP per capita for each country for five year intervals from 1952 to 2007.

rm(list = ls())
##load the packages
library(tidyverse); library(nycflights13); library(gapminder)
gapminder

Scatterplot

The first plot we will explore is the scatter-plot. A scatter-plot simply places a point on a cartesian (2-D) coordinate system for corresponding values of the x and y variables. Lets explore the relationship between year and life expectancy using a scatter-plot.

Go through the code below to identify the data, aesthetics and geom layers. As can be seen below, this is not a particularly useful chart. Each year in the data has several observations for life expectancy (one for each country), this results in a chart for which it is difficult to perceive a clear trend. In the next step we will see how to improve this chart.

##specify the data and aesthetics layer
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
    ##specifying the geom type
    geom_point() +
    ##adding x and y axis titles
    labs(x = "Year", y = "Life Expectancy")

Ninja Tasks

  1. Use a scatter-plot to explore the relationship between GDP per capita and life expectancy
  2. Bonus question: Can you map the population to a particular aesthetic so that it can be displayed on the chart too?

🏆Solution🏆

1. GDP per capita vs life expectancy

The gapminder data consists of observations for each country for each year. We are interested in capturing the relationship between GDP per capita and the life expectancy. However, since the data is a time series, we need to average the GDP per capita and life expectancy for each country for the entire time series before plotting to smooth out time trends and avoid over-plotting 2. This would allow us to observe the relationship between the two variables more clearly without any confusing time trends 3.

gapminder %>% 
    ##find the mean gdp per capita and life expectancy for each country
    group_by(country) %>% 
    summarise(meanGDPperCap = mean(gdpPercap, na.rm = T), 
              meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ##specify the data and aesthetics layer
    ggplot(., aes(x = meanGDPperCap, y = meanLifeExp)) + 
    ##specify the geom
    geom_point(colour = "blue", size = 2.5, alpha = 0.8) +
    ##specify titles
    labs(x = "Mean GDP per capita", 
         y = "Mean Life Expectancy", 
         title = "Life Expectancy increases with GDP per capita")

The scatter-plot shows one single point for each country in the dataset representing the mean life expectancy and GDP per capita. Make sure to read the code used to generate the chart carefully so that you understand it fully. Why are the colour and size paramters in the geom_point() not inside an aes()?

3. Map an additional aesthetic to the plot

Lets map the size of each point to the meanPop variable. I also introduce an additional aesthetic mapping of colour to the continent variable. In addition, since there are large values in the GDP per capita, we can plot the log of the mean GDP per capita to smooth out the large numbers (watch the video below if you don’t fully understand logarithmic scales).

We can do this in two ways, first we could mutate meanGDPperCap by taking its log and plotting that (try this yourself)4, and the second option would be to use ggplots inbuilt scale_x_log10 command to transform the plotting scale instead of the variable. While the charts would largely look the same, the second option preserves the actual variable and plots it on the new scale. Plot the first option on your own to see if you can spot the difference between the plot below and that one.

The application of the log scale has the effect of making it appear as if the relationship between GDP per capita and life expectancy is linear. This is not true, since as we observed in the previous chart the true relationship between these two variables is parabolic. As GDP increases, the corresponding increase in life expectancy gets lower over time (probably because we can’t extend human lifespans beyond a certain limit, no matter how hard we try). This is one reason, why we should be extremely careful with scale transformations. We should only use them if we have a clear reason for why they are necessary. And when we do choose to use them we should have a clear sense for what they are doing to the interpretation of the chart.

gapminder %>% 
    ##Calculate country level information
    group_by(country) %>% 
    summarise(meanGDPperCap = mean(gdpPercap, na.rm = T), 
              meanLifeExp = mean(lifeExp, na.rm = T), 
              meanPop = mean(pop, na.rm = T),
              continent = unique(continent)) %>% 
    ##specify data and aesthetics
    ggplot(., aes(x = meanGDPperCap, 
                  y = meanLifeExp, 
                  size = meanPop, 
                  colour = continent)) + 
    ##specify the geom
    geom_point(alpha = 0.6) +
    ##apply scale transformation
    scale_x_log10() +
    ##specify the titles
    labs(x = "Mean GDP per capita", 
         y = "Mean Life Expectancy", 
         title = "Life Expectancy increases exth GDP per capita", 
         size = "Mean population",
         colour = "Continent") +
    ##adjust the position and layout of the legends
    theme(legend.position = "top",
          legend.box = "vertical",
          legend.spacing.y = unit(-8, "pt"),
          )

Line chart

Now lets use dplyr to calculate the average global life expectancy by year and plot that using a line. As can be seen below, the average global life expectancy has been increasing over the years.

gapminder %>% 
    ##group by year to calculate the yearly average life expectancy
    group_by(year) %>% 
    summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ##draw a line marking the trend of life expectancy over years
    ggplot(., aes(x = year, y = meanLifeExp)) +
    geom_line() +
    labs(title = "Average life expectancy has been increasing", x = "Years", y = "Mean Life Expectancy")

Nw lets go a bit further and explore the trend for life expectancy for different countries. No

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
    ##colour and group are mapped to continent and country and aesthetic is set to 0.5
    geom_line(mapping = aes(colour = continent, group = country), alpha = 0.5) +
    labs(title = "Countries in Africa have lower life expectancy", x = "Year", y = "Life Expectancy") +
    theme(
        legend.position = "top"
        
    )

In the chart above there are a few countries were there are sudden drops in life expectancy. These are because of genocides that occurred in Rwanda, Cambodia and China. The chart below used color and alpha mapping to highlight these values. Notice how, I use a alphaMapping and colorMapping variables to fix the alpha and colour aesthetics in the chart.[^6]

##repeat the same but with one line for each continent/country, (guess why there is a sudden drop for a few countries)?
alphaMapping <- if_else(gapminder$country %in% c("Rwanda", "Cambodia", "China"), 0.8, 0.1)
colorMapping <- if_else(gapminder$country %in% c("Rwanda", "Cambodia", "China"), "darkred", "black")
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, group = country)) +
    geom_line(alpha = alphaMapping, colour = colorMapping) +
    labs(title = "The impact of genocides on life expectancy", x = "Year", y = "Life Expectancy")

Ninja Tasks

  1. Draw a line chart showing the trend for average GDP per capita through time for each continent

🏆Solution🏆 The chart below shows that poorer countries in the Africa and Asia are still lagging behind those in the West. According to Wikipedia convergence can be described as follows:

The idea of convergence in economics (also sometimes known as the catch-up effect) is the hypothesis that poorer economies’ per capita incomes will tend to grow at faster rates than richer economies. As a result, all economies should eventually converge in terms of per capita income.

gapminder %>% 
    group_by(continent, year) %>% 
    summarise(meanGDP = mean(gdpPercap, na.rm = T)) %>% 
    ggplot(., aes(x = year, y = meanGDP)) +
    geom_line(aes(colour = continent)) +
    labs(x = "Year", y = "Mean GDP per Capita", title = "Convergence where art thou?", colour = "Continent") +
    theme(legend.position = "top")

  1. Histogram Watch the video below to refresh your understanding of histograms. The plot below shows the histogram of gdpPerCap in the gapminder dataset. There is however a problem with this chart. It shows the distribution of GDP per capita over time. We are however not interested in the spread of GDP per capita across time, rather our interest is in seeing the spread across countries. In this case, we might be better off by filtering down to a single year and observing the distribution of GDP per capita across countries. Lets try this out with the training exercise.
ggplot(data = gapminder, aes(gdpPercap)) +
    geom_histogram(bins = 30)

Ninja Tasks

  1. Draw a histogram to show the characteristics of population

🏆Solution🏆

For this chart we select the latest year in the dataset and plot the distribution of the population. Could you think of the pros and cons for using this method versus the one in which we calculate the average population for each country over the entire dataset and plotting that?

gapminder %>% 
    filter(year == max(year, na.rm = T)) %>% 
    ggplot(., aes(pop)) +
    geom_histogram(bins = 60) +
    labs(x = "Population", y = "Count")

Bar chart

The chart below shows the average life expectancy for different continents for the most recent year in the data. Can you think of the reason why it might be slightly better to have only considered the most recent year when calculating the average life expectancy for a continent? Also notice, how the bars are aligned from the smallest to the tallest. Can you find out how I might have achieved this?

##draw a bar chart with average life expectancy in different continents
gapminder %>% 
    filter(year == max(year, na.rm = T)) %>% 
    group_by(continent) %>% 
    summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ggplot(., aes(x = reorder(continent, meanLifeExp), y = meanLifeExp)) +
    geom_col() +
    labs(y = "Mean life expectancy") +
    theme(
        axis.title.x = element_blank()
    )

NA

Ninja Tasks

  1. Draw a bar chart showing the average lifeExp across years

🏆Solution🏆

gapminder %>% 
    group_by(year) %>% 
    summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ggplot(data = ., aes(x = year, y = meanLifeExp)) +
    geom_col()

NA
gapminder %>% 
    group_by(year) %>% 
    summarise(meanGDP = mean(gdpPercap, na.rm = T)) %>% 
    ggplot(., aes(x = year, y = meanGDP)) +
    geom_col()

Facetting

##reuse some of the plots above as a facetted plot
ggplot(gapminder, aes(x = year, y = lifeExp)) +

Ninja Tasks

  1. Show the relationship between GDP per capita and Life Expectancy using a faceted chart of your choice

Layering multiple geoms

##draw a line of fit in a chart of population and life expectancy
gapminder %>% 
    group_by(country) %>% 
    summarise(continent = unique(continent), meanPop = mean(pop, na.rm = T), meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    filter(meanPop < 5e7) %>% 
ggplot(., aes(x = meanPop, y = meanLifeExp)) +
    geom_jitter() + 
    geom_smooth(method = "lm", se = F)

Ninja Tasks

  1. Add a line of best fit to a scatter plot showing the relationship between GDP per capita and life expectancy

Adding summary stats to plots

##explore gdp per capita over the years using a jitter plot and stat_summary (rule of thumb, always stay as close to the data as possible)
    
ggplot(gapminder, aes(x = year, y = gdpPercap)) +
    geom_jitter()

##Using vline and hline

Ninja Tasks

  1. Add a vertical line showing the mean of population to the histogram showing its distribution

Add some style

##Use the plot from the previous section and add some pizazz (explore ggthemes, legend position etc)

Add some more complexity (based on time)

Coords

Scales

Data Joins

There are 6 different types of joins. These are as follows:

  1. inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned
  2. left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
  3. right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
  4. full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
  5. semi_join(): return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
  6. anti_join(): return all rows from x where there are not matching values in y, keeping just columns from x
band_members
band_instruments
inner_join(band_members, band_instruments)
Joining, by = "name"
left_join(band_members, band_instruments)
Joining, by = "name"
right_join(band_members, band_instruments)
Joining, by = "name"
full_join(band_members, band_instruments)
Joining, by = "name"
semi_join(band_members, band_instruments)
Joining, by = "name"
anti_join(band_members, band_instruments)
Joining, by = "name"

Left Join

This is the most popular type of join (analogous to Vlookup in Excel).

library(nycflights13)
flightsWithWeather <- left_join(flights, weather, by = c("origin", "month", "day", "hour"))
weather

  1. The gapminder package is an excerpt of the data that exists here. This data was created by the Gapminder Foundation led by Hans Rosling.

  2. Over-plotting happens when data points are plotted on top of each other multiple times. It makes charts more confusing and difficult to read. It should be avoided as far as possible

  3. We could also do this by filtering out the most recent year to study the relationship. Both are valid options, however the former has the benefit of capturing more information since it takes into account the entire time series of data that is available to calculate the mean values, while the latter disregards all but one year of the data. In this case this suits our purposes of wanting to study the relationship between GDP per capita and life expectancy.

  4. You can do this using the following command ggplot(., aes(x = log(meanGDPperCap, base = 10), y = meanLifeExp)) instead of ggplot(., aes(x = meanGDPperCap, y = meanLifeExp)). Can you tell the difference between this plot and the one that was made using the scale transformation?

---
title: "Numbers Ninja Week 2: Data visualizations and joins"
author: |
  | Hari Subhash
  | Data Scientist @NRGI
date: "`r Sys.Date()`"
output:
  html_notebook:
    highlight: kate
    smart: yes
    theme: cosmo
  html_document:
    df_print: paged
    includes: assets/header.html
css: assets/custom.css
---
<div style= "float:right; position: relative; top: -80px; padding-left: 0px">
```{r, echo=FALSE, warning=FALSE, message=FALSE, out.width='10%'}
##add the icon
knitr::include_graphics("assets/images/ninja-logo2.jpg")
```
</div>

##Topics Covered

1. Visualizing data using ggplot2
2. Data Joins

<div class="full-width">

```{r, echo=FALSE}
knitr::include_graphics("assets/images/ian-dooley-407846-unsplash.jpg")
```

</div>

##Visualizing data using ggplot2
ggplot2 is a declarative plotting library in R. The gg in ggplot2 stands for "Grammar of Graphics" a book on the principles of data visualization that the package is based on. According to this [article](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149):

>a grammar of graphics is a framework which follows a layered approach to describe and construct visualizations or graphics in a structured manner. A visualization involving multi-dimensional data often has multiple components or aspects, and leveraging this layered grammar of graphics helps us describe and understand each component involved in visualization — in terms of data, aesthetics, scale, objects and so on.

You can think about the layering of graphics in the same way as composition that you learnt last week. The idea of building complex functionality by breaking things down into smaller pieces using pipes.

<div class="center-container">
```{r, echo=FALSE}
knitr::include_graphics("assets/images/layers of ggplot.png")
```
</div>
<small>**Source**: A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data - Dipanjan Sarkar</small>

There are seven layers in ggplot. All plots need atleast three out of these seven layers (the rest adopt default values if not specified). These are as follows:

1. A data layer
2. An aesthetics layer
3. A geom or statistics layer

Lets start plotting!

###Basic charts
In this section we will explore a few basic chart types. Lets also open the [reference website for ggplot](https://ggplot2.tidyverse.org/reference/index.html) in a new tab so that we have it handy. We will be using the gapminder dataset to illustrate ggplot commands. This dataset is available as a package of the same name [^1]. Go ahead and install this on your computer. The gapminder dataset is displayed below. It provides data on life expectancy, population and GDP per capita for each country for five year intervals from `r min(gapminder$year)` to `r max(gapminder$year)`.

```{r, message=FALSE, warning=FALSE}
rm(list = ls())
##load the packages
library(tidyverse); library(nycflights13); library(gapminder)
gapminder
```

###Scatterplot
The first plot we will explore is the scatter-plot. A scatter-plot simply places a point on a cartesian (2-D) coordinate system for corresponding values of the x and y variables. Lets explore the relationship between year and life expectancy using a scatter-plot. 

Go through the code below to identify the data, aesthetics and geom layers. As can be seen below, this is not a particularly useful chart. Each year in the data has several observations for life expectancy (one for each country), this results in a chart for which it is difficult to perceive a clear trend. In the next step we will see how to improve this chart.

```{r}
##specify the data and aesthetics layer
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
    ##specifying the geom type
    geom_point() +
    ##adding x and y axis titles
    labs(x = "Year", y = "Life Expectancy")
```

⚡***Ninja Tasks***⚡

1. Use a scatter-plot to explore the relationship between GDP per capita and life expectancy
2. Bonus question: Can you map the population to a particular aesthetic so that it can be displayed on the chart too?

🏆***Solution***🏆

####1. GDP per capita vs life expectancy
The gapminder data consists of observations for each country for each year. We are interested in capturing the relationship between GDP per capita and the life expectancy. However, since the data is a time series, we need to average the GDP per capita and life expectancy for each country for the entire time series before plotting to smooth out time trends and avoid over-plotting [^2]. This would allow us to observe the relationship between the two variables more clearly without any confusing time trends [^3].

```{r}
gapminder %>% 
    ##find the mean gdp per capita and life expectancy for each country
    group_by(country) %>% 
    summarise(meanGDPperCap = mean(gdpPercap, na.rm = T), 
              meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ##specify the data and aesthetics layer
    ggplot(., aes(x = meanGDPperCap, y = meanLifeExp)) + 
    ##specify the geom
    geom_point(colour = "blue", size = 2.5, alpha = 0.8) +
    ##specify titles
    labs(x = "Mean GDP per capita", 
         y = "Mean Life Expectancy", 
         title = "Life Expectancy increases with GDP per capita")
```

The scatter-plot shows one single point for each country in the dataset representing the mean life expectancy and GDP per capita. Make sure to read the code used to generate the chart carefully so that you understand it fully. Why are the colour and size paramters in the `geom_point()` not inside an `aes()`?

####3. Map an additional aesthetic to the plot

Lets map the size of each point to the `meanPop` variable. I also introduce an additional aesthetic mapping of colour to the continent variable. In addition, since there are large values in the GDP per capita, we can plot the log of the mean GDP per capita to smooth out the large numbers (<span class="highlight">[watch](#logs)</span> the video below if you don't fully understand logarithmic scales). 

We can do this in two ways, first we could mutate `meanGDPperCap` by taking its log and plotting that (try this yourself)[^4], and the second option would be to use ggplots inbuilt `scale_x_log10` command to transform the plotting scale instead of the variable. While the charts would largely look the same, the second option preserves the actual variable and plots it on the new scale. Plot the first option on your own to see if you can spot the difference between the plot below and that one.

The application of the log scale has the effect of making it appear as if the relationship between GDP per capita and life expectancy is linear. <span class="highlight">This is not true</span>, since as we observed in the previous chart the true relationship between these two variables is parabolic. As GDP increases, the corresponding increase in life expectancy gets lower over time (probably because we can't extend human lifespans beyond a certain limit, no matter how hard we try). This is one reason, why we should be extremely careful with scale transformations. We should only use them if we have a clear reason for why they are necessary. And when we do choose to use them we should have a clear sense for what they are doing to the interpretation of the chart.
```{r}
gapminder %>% 
    ##Calculate country level information
    group_by(country) %>% 
    summarise(meanGDPperCap = mean(gdpPercap, na.rm = T), 
              meanLifeExp = mean(lifeExp, na.rm = T), 
              meanPop = mean(pop, na.rm = T),
              continent = unique(continent)) %>% 
    ##specify data and aesthetics
    ggplot(., aes(x = meanGDPperCap, 
                  y = meanLifeExp, 
                  size = meanPop, 
                  colour = continent)) + 
    ##specify the geom
    geom_point(alpha = 0.6) +
    ##apply scale transformation
    scale_x_log10() +
    ##specify the titles
    labs(x = "Mean GDP per capita", 
         y = "Mean Life Expectancy", 
         title = "Life Expectancy increases exth GDP per capita", 
         size = "Mean population",
         colour = "Continent") +
    ##adjust the position and layout of the legends
    theme(legend.position = "top",
          legend.box = "vertical",
          legend.spacing.y = unit(-8, "pt"),
          )
```

<div class="center-container">
<a name="logs"><iframe width="720" height="405" src="https://www.youtube.com/embed/sBhEi4L91Sg" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></a>
</div>

### Line chart
Now lets use dplyr to calculate the average global life expectancy by year and plot that using a line. As can be seen below, the average global life expectancy has been increasing over the years.

```{r}
gapminder %>% 
    ##group by year to calculate the yearly average life expectancy
    group_by(year) %>% 
    summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ##draw a line marking the trend of life expectancy over years
    ggplot(., aes(x = year, y = meanLifeExp)) +
    geom_line() +
    labs(title = "Average life expectancy has been increasing", x = "Years", y = "Mean Life Expectancy")
```


Nw lets go a bit further and explore the trend for life expectancy for different countries. No
```{r}
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
    ##colour and group are mapped to continent and country and aesthetic is set to 0.5
    geom_line(mapping = aes(colour = continent, group = country), alpha = 0.5) +
    labs(title = "Countries in Africa have lower life expectancy", x = "Year", y = "Life Expectancy") +
    theme(
        legend.position = "top"
        
    )
```

In the chart above there are a few countries were there are sudden drops in life expectancy. These are because of genocides that occurred in Rwanda, Cambodia and China. The chart below used color and alpha mapping to highlight these values. Notice how, I use a alphaMapping and colorMapping variables to fix the alpha and colour aesthetics in the chart.[^6] 


```{r}
##repeat the same but with one line for each continent/country, (guess why there is a sudden drop for a few countries)?
alphaMapping <- if_else(gapminder$country %in% c("Rwanda", "Cambodia", "China"), 0.8, 0.1)
colorMapping <- if_else(gapminder$country %in% c("Rwanda", "Cambodia", "China"), "darkred", "black")

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, group = country)) +
    geom_line(alpha = alphaMapping, colour = colorMapping) +
    labs(title = "The impact of genocides on life expectancy", x = "Year", y = "Life Expectancy")
```


⚡***Ninja Tasks***⚡

1. Draw a line chart showing the trend for average GDP per capita through time for each continent

🏆***Solution***🏆
The chart below shows that poorer countries in the Africa and Asia are still lagging behind those in the West. According to [Wikipedia](https://en.wikipedia.org/wiki/Convergence_(economics)) convergence can be described as follows:

>The idea of convergence in economics (also sometimes known as the catch-up effect) is the hypothesis that poorer economies' per capita incomes will tend to grow at faster rates than richer economies. As a result, all economies should eventually converge in terms of per capita income.

```{r}
gapminder %>% 
    group_by(continent, year) %>% 
    summarise(meanGDP = mean(gdpPercap, na.rm = T)) %>% 
    ggplot(., aes(x = year, y = meanGDP)) +
    geom_line(aes(colour = continent)) +
    labs(x = "Year", y = "Mean GDP per Capita", title = "Convergence where art thou?", colour = "Continent") +
    theme(legend.position = "top")
```



3. Histogram
Watch the video [below](#hist) to refresh your understanding of histograms. The plot below shows the histogram of `gdpPerCap` in the gapminder dataset. There is however a problem with this chart. It shows the distribution of GDP per capita over time. We are however not interested in the spread of GDP per capita across time, rather our interest is in seeing the spread across countries. In this case, we might be better off by filtering down to a single year and observing the distribution of GDP per capita across countries. Lets try this out with the training exercise.
```{r}
ggplot(data = gapminder, aes(gdpPercap)) +
    geom_histogram(bins = 30)
```
<div class="center-container">
<a id="hist"><iframe width="560" height="315" src="https://www.youtube.com/embed/gSEYtAjuZ-Y" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></a>
</div>

⚡***Ninja Tasks***⚡

1. Draw a histogram to show the characteristics of population

🏆***Solution***🏆

For this chart we select the latest year in the dataset and plot the distribution of the population. Could you think of the pros and cons for using this method versus the one in which we calculate the average population for each country over the entire dataset and plotting that?

```{r}
gapminder %>% 
    filter(year == max(year, na.rm = T)) %>% 
    ggplot(., aes(pop)) +
    geom_histogram(bins = 60) +
    labs(x = "Population", y = "Count")
```

###Bar chart
The chart below shows the average life expectancy for different continents for the most recent year in the data. Can you think of the reason why it might be slightly better to have only considered the most recent year when calculating the average life expectancy for a continent? Also notice, how the bars are aligned from the smallest to the tallest. Can you find out how I might have achieved this?
```{r}
##draw a bar chart with average life expectancy in different continents
gapminder %>% 
    filter(year == max(year, na.rm = T)) %>% 
    group_by(continent) %>% 
    summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ggplot(., aes(x = reorder(continent, meanLifeExp), y = meanLifeExp)) +
    geom_col() +
    labs(y = "Mean life expectancy") +
    theme(
        axis.title.x = element_blank()
    )
    
```

⚡***Ninja Tasks***⚡

1. Draw a bar chart showing the average lifeExp across years

🏆***Solution***🏆
```{r}
gapminder %>% 
    group_by(year) %>% 
    summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    ggplot(data = ., aes(x = year, y = meanLifeExp)) +
    geom_col()
    
```

```{r}
gapminder %>% 
    group_by(year) %>% 
    summarise(meanGDP = mean(gdpPercap, na.rm = T)) %>% 
    ggplot(., aes(x = year, y = meanGDP)) +
    geom_col()
```

###Facetting

```{r}
##reuse some of the plots above as a facetted plot
ggplot(gapminder, aes(x = year, y = lifeExp)) +
    geom_jitter(width = 1.5, alpha = 0.6) +
    facet_wrap(~continent)
```


⚡***Ninja Tasks***⚡

1. Show the relationship between GDP per capita and Life Expectancy using a faceted chart of your choice




###Layering multiple geoms


```{r}
##draw a line of fit in a chart of population and life expectancy
gapminder %>% 
    group_by(country) %>% 
    summarise(continent = unique(continent), meanPop = mean(pop, na.rm = T), meanLifeExp = mean(lifeExp, na.rm = T)) %>% 
    filter(meanPop < 5e7) %>% 
ggplot(., aes(x = meanPop, y = meanLifeExp)) +
    geom_jitter() + 
    geom_smooth(method = "lm", se = F)
```


⚡***Ninja Tasks***⚡

1. Add a line of best fit to a scatter plot showing the relationship between GDP per capita and life expectancy

###Adding summary stats to plots


```{r}
##explore gdp per capita over the years using a jitter plot and stat_summary (rule of thumb, always stay as close to the data as possible)

    
```




```{r}
ggplot(gapminder, aes(x = year, y = gdpPercap)) +
    geom_jitter()
```

```{r}
##Using vline and hline

```



⚡***Ninja Tasks***⚡

1. Add a vertical line showing the mean of population to the histogram showing its distribution


###Add some style


```{r}
##Use the plot from the previous section and add some pizazz (explore ggthemes, legend position etc)
```




###Add some more complexity (based on time)

####Coords

####Scales


##Data Joins

There are 6 different types of joins. These are as follows:

1. `inner_join()`: return <span class="highlight">all rows from x where there are matching values in y</span>, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned
2. `left_join()`: return <span class="highlight">all rows from x</span>, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
3. `right_join()`: return <span class="highlight">all rows from y</span>, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
4. `full_join()`: return <span class="highlight">all rows</span> and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
5. `semi_join()`: return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
6. `anti_join()`: return all rows from x where there are not matching values in y, keeping just columns from x

```{r}
band_members
```

```{r}
band_instruments
```
```{r}
inner_join(band_members, band_instruments)
```



```{r}
left_join(band_members, band_instruments)
```


```{r}
right_join(band_members, band_instruments)
```

```{r}
full_join(band_members, band_instruments)
```



```{r}
semi_join(band_members, band_instruments)
```


```{r}
anti_join(band_members, band_instruments)
```


###Left Join
This is the most popular type of join (analogous to Vlookup in Excel).
```{r}
library(nycflights13)

flightsWithWeather <- left_join(flights, weather, by = c("origin", "month", "day", "hour"))
```


```{r}
weather
```




[^1]: The gapminder package is an excerpt of the data that exists [here](https://www.gapminder.org/data/). This data was created by the Gapminder Foundation led by Hans Rosling.
[^2]: Over-plotting happens when data points are plotted on top of each other multiple times. It makes charts more confusing and difficult to read. It should be avoided as far as possible
[^3]: We could also do this by filtering out the most recent year to study the relationship. Both are valid options, however the former has the benefit of capturing more information since it takes into account the entire time series of data that is available to calculate the mean values, while the latter disregards all but one year of the data. In this case this suits our purposes of wanting to study the relationship between GDP per capita and life expectancy.
[^4]: You can do this using the following command `ggplot(., aes(x = log(meanGDPperCap, base = 10), y = meanLifeExp))` instead of `ggplot(., aes(x = meanGDPperCap, y = meanLifeExp))`. Can you tell the difference between this plot and the one that was made using the scale transformation?
[^5]: This5 aesthetic parameters are not being mapped to the data. Instead we are using vectors that were created outside the current function call to specify the values for these vectors.
